{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "5abe2121-5381-46d7-a849-66f921883972",
   "metadata": {},
   "source": [
    "# LangChain 核心模块：Data Conneciton - Text Embedding Models\n",
    "\n",
    "Embeddings类是一个专门用于与文本嵌入模型进行交互的类。有许多嵌入模型提供者（OpenAI、Cohere、Hugging Face等）-这个类旨在为所有这些提供者提供一个标准接口。\n",
    "\n",
    "嵌入将一段文本创建成向量表示。这非常有用，因为它意味着我们可以在向量空间中思考文本，并且可以执行语义搜索等操作，在向量空间中寻找最相似的文本片段。\n",
    "\n",
    "LangChain中基础的Embeddings类公开了两种方法：一种用于对文档进行嵌入，另一种用于对查询进行嵌入。前者输入多个文本，而后者输入单个文本。之所以将它们作为两个独立的方法，是因为某些嵌入提供者针对要搜索的文件和查询（搜索查询本身）具有不同的嵌入方法。\n",
    "\n",
    "\n",
    "![](https://python.langchain.com/assets/images/data_connection-c42d68c3d092b85f50d08d4cc171fc25.jpg)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f994c4f8-58cf-4d34-b58f-205b42535177",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "df8d7408-b14f-4cfb-84c0-9c0bae958cce",
   "metadata": {
    "jupyter": {
     "source_hidden": true
    }
   },
   "source": [
    "### Document 类\n",
    "\n",
    "这段代码定义了一个名为`Document`的类，允许用户与文档的内容进行交互，可以查看文档的段落、摘要，以及使用查找功能来查询文档中的特定字符串。\n",
    "\n",
    "```python\n",
    "# 基于BaseModel定义的文档类。\n",
    "class Document(BaseModel):\n",
    "    \"\"\"接口，用于与文档进行交互。\"\"\"\n",
    "\n",
    "    # 文档的主要内容。\n",
    "    page_content: str\n",
    "    # 用于查找的字符串。\n",
    "    lookup_str: str = \"\"\n",
    "    # 查找的索引，初次默认为0。\n",
    "    lookup_index = 0\n",
    "    # 用于存储任何与文档相关的元数据。\n",
    "    metadata: dict = Field(default_factory=dict)\n",
    "\n",
    "    @property\n",
    "    def paragraphs(self) -> List[str]:\n",
    "        \"\"\"页面的段落列表。\"\"\"\n",
    "        # 使用\"\\n\\n\"将内容分割为多个段落。\n",
    "        return self.page_content.split(\"\\n\\n\")\n",
    "\n",
    "    @property\n",
    "    def summary(self) -> str:\n",
    "        \"\"\"页面的摘要（即第一段）。\"\"\"\n",
    "        # 返回第一个段落作为摘要。\n",
    "        return self.paragraphs[0]\n",
    "\n",
    "    # 这个方法模仿命令行中的查找功能。\n",
    "    def lookup(self, string: str) -> str:\n",
    "        \"\"\"在页面中查找一个词，模仿cmd-F功能。\"\"\"\n",
    "        # 如果输入的字符串与当前的查找字符串不同，则重置查找字符串和索引。\n",
    "        if string.lower() != self.lookup_str:\n",
    "            self.lookup_str = string.lower()\n",
    "            self.lookup_index = 0\n",
    "        else:\n",
    "            # 如果输入的字符串与当前的查找字符串相同，则查找索引加1。\n",
    "            self.lookup_index += 1\n",
    "        # 找出所有包含查找字符串的段落。\n",
    "        lookups = [p for p in self.paragraphs if self.lookup_str in p.lower()]\n",
    "        # 根据查找结果返回相应的信息。\n",
    "        if len(lookups) == 0:\n",
    "            return \"No Results\"\n",
    "        elif self.lookup_index >= len(lookups):\n",
    "            return \"No More Results\"\n",
    "        else:\n",
    "            result_prefix = f\"(Result {self.lookup_index + 1}/{len(lookups)})\"\n",
    "            return f\"{result_prefix} {lookups[self.lookup_index]}\"\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b68fdbcb-b60d-441f-91fc-d8cac24ba3e1",
   "metadata": {},
   "source": [
    "## 使用 OpenAIEmbeddings 调用 OpenAI 嵌入模型\n",
    "\n",
    "\n",
    "### 使用 embed_documents 方法嵌入文本列表"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "8dadd89b-6a13-4391-9102-acde028b61d5",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.embeddings import OpenAIEmbeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "24f9c721-dfd3-4632-a89e-92d2fa9b3594",
   "metadata": {},
   "outputs": [],
   "source": [
    "embeddings_model = OpenAIEmbeddings()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "67076c7d-54cc-47a9-a5bd-85570355a7d2",
   "metadata": {},
   "outputs": [],
   "source": [
    "embeddings = embeddings_model.embed_documents(\n",
    "    [\n",
    "        \"Hi there!\",\n",
    "        \"Oh, hello!\",\n",
    "        \"What's your name?\",\n",
    "        \"My friends call me World\",\n",
    "        \"Hello World!\"\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "79e3c50a-efad-49bb-84fc-f0fb2585b34e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(embeddings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "9e85d4c4-ee77-4470-85ec-ce37f359fdc7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1536"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(embeddings[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "32e3ddc2-9d69-4b72-ace1-800ba94def79",
   "metadata": {},
   "source": [
    "### 使用 embed_query 方法嵌入问题"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "0069d18d-1fb6-40e2-974c-2f49559b8b9a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# QA场景：嵌入一段文本，以便与其他嵌入进行比较。\n",
    "embedded_query = embeddings_model.embed_query(\"What was the name mentioned in the conversation?\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "2c3068f7-22b7-4215-99c0-7e47f9a3aa46",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1536"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(embedded_query)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "25339b8d-ee23-460c-ae4e-97a8f24f6add",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
